The American Journal of Human Genetics
○ Elsevier BV
Preprints posted in the last 90 days, ranked by how well they match The American Journal of Human Genetics's content profile, based on 206 papers previously published here. The average preprint has a 0.20% match score for this journal, so anything above that is already an above-average fit.
Wang, H.; Wainschtein, P.; Sidorenko, J.; Fikere, M.; Zhang, Y.; Kemper, K. E.; Zheng, Z.; Hivert, V.; Zeng, J.; Goddard, M. E.; Visscher, P. M.; Yengo, L.
Show abstract
Assessing the contribution of ultra-rare variants (minor allele frequency <0.01%) to the heritability of complex traits remains challenging due to limited understanding of potential biases. Here, we focus on singletons (that is, variants observed only once in the study sample), the most abundant class of ultra-rare variants, to showcase various confounders of heritability estimates and underline pitfalls in their interpretation. We show through theory, simulations, and analysis of 5,330,210 exome-sequenced singletons in 305,813 unrelated European-ancestry individuals in the UK Biobank that (i) population stratification induces both upward and downward biases in singleton-based heritability estimates (), (ii) estimates capture non-additive genetic effects, and (iii) asymptotic standard errors of estimates from likelihood-based procedures are generally mis-calibrated when traits are not normally distributed. We further showcase these biases in real-data analyses of 22 quantitative phenotypes and report, after accounting for these pitfalls, significant estimate for number of children (3.4%), peak expiratory flow (1.9%), red blood cell count (2.5%), white blood cell count (1.9%) and heel bone mineral density (2.4%). Overall, our study provides recommendations for robust inference of heritability from ultra rare variants and underscores that reliable estimates for ordinal and binary traits will require far larger sample sizes and improved methods, given that confounding in these traits remains difficult to detect and correct
Konovalov, F. A.
Show abstract
Allele count data from affected individuals and population controls are central to variant interpretation, yet their evidential meaning is often mediated by discrete thresholds and implicit assumptions. This work introduces a fully quantitative Bayesian framework for dominant rare disease genetics in which all allele count evidence is summarized by a single quantity, the Bayes factor, that evaluates the probability of observing the same data under two explicitly defined competing models. Rather than replacing individual ACMG/AMP pathogenicity criteria, the Bayes factor provides a unified measure that naturally incorporates evidence in both the pathogenic and benign directions. The framework accounts for variation in affected cohort size, penetrance, disease prevalence, and assay error rates, allowing these biologically and technically meaningful quantities to be specified directly instead of absorbed into fixed cutoffs. Application to a non-Finnish European population shows that the dependence of the Bayes factor on observed allele counts is strongly shaped by how the affected cohort is defined and by false positive rates in control datasets. Across representative scenarios, Bayes factor values are broadly compatible with established allele count criteria combinations expressed on odds-ratio scales under typical parameterizations, while remaining tunable beyond these defaults.
Gibbs, P. M.; Beasley, I. J.; Del Azodi, C. B.; McCarthy, D. J.; Gallego Romero, I.
Show abstract
The phenotypic effects of germline variants are often mediated through gene regulation. Expression quantitative trait loci (eQTLs) are genetic variants associated with changes in gene expression. Understanding how eQTLs vary across populations is essential for characterising the genetic and regulatory drivers of trait diversity. Meta-analysing eQTL studies from multiple populations enables more robust detection of eQTLs and can reveal regulatory mechanisms shaped by population-specific environmental or ancestry-related factors. However, across the multi-ancestry eQTL literature, a wide range of methods have been used to quantify eQTL portability across ancestry groups. Because different studies employ different portability metrics, it is challenging to form a coherent view of the regulatory landscape across populations. In this work, we analyse eQTL summary statistics from ten datasets matched on tissue type and sequencing technology. We compare portability metrics used previously and show that they can yield markedly different patterns of apparent regulatory conservation or divergence. We then examine the statistical determinants of portability across metrics and demonstrate that sample size, minor allele frequency, and linkage disequilibrium are major drivers of the observed differences in eQTL portability across studies. These findings highlight that differences in statistical power stemming from factors such as population size and allele frequency must be accounted for when evaluating eQTL portability. To address this issue, we introduce a new approach designed to correct for these factors when calling eQTL portability. Finally, we show that empirical Bayes multivariate adaptive shrinkage provides a powerful framework for meta-analysing multiple eQTL studies, with the ability to pool signals across populations to produce more robust effect-size estimates within each population.
Aguirre, M.; Irudayanathan, F. J.; Crow, M.; Hejase, H. A.; Menon, V. K.; Pendergrass, R. K.; McCarthy, M. I.; Fletez-Brant, K.
Show abstract
Machine learning-based annotation methods are increasingly used to assess the pathogenicity of genetic variants, but their performance at prioritizing variants for gene-level association testing remains poorly characterized. Here, we systematically benchmark five annotation methods -- CADD v1.6, CADD v1.7, AlphaMissense, ESM-1b, and GPN-MSA -- using four primary gene-based tests and six annotation-level aggregation tests across 14 quantitative traits measured in up to 350,377 UK Biobank participants. Using a novel framework based on Wasserstein dis-tances, we quantify how annotation choice affects test calibration and power. Tests using CADD annotations achieve the highest signal separation, while tests using AlphaMissense annotations exhibit systematically lower calibration. All combinations of methods produced significant re-sults that were enriched (1.8-5.8-fold) for loss-of-function intolerant genes, though tests using GPN-MSA annotations displayed the highest such enrichment. Replication across symmetric phenotypes and loss-of-function burden tests was generally similar across methods. Our anal-ysis provides practical guidance for annotation method selection in rare variant studies and establishes a distributional framework for calibration assessment.
Dudek, M. F.; Wenz, B. M.; Voight, B. F.; Almasy, L.; Grant, S. F. A.
Show abstract
The vast majority of trait-associated loci discovered through genome-wide association studies (GWAS) are non-coding, yet most lack statistical alignment with any discovered expression quantitative trait loci (eQTLs). In particular, eQTLs are depleted at gene-distal regions and at "functionally important" genes - those with strong selective constraint and complex regulatory landscapes - likely due to selective depletion of high-effect variants. Here, we investigate the role of variants with weaker effects on expression transmitted through distal regulatory elements, which are detectable as chromatin accessibility QTLs (caQTLs). We aggregated caQTL data from ten studies derived across different tissues, cell-types and lines, representing 104,024 lead caQTLs across 3,457 samples. We found that, across a range of gene properties, caQTLs are discovered at functionally important genes more often than eQTLs. These observations are consistent with a model in which many eQTLs and GWAS hits are mediated through genetic effects on regulatory elements, which may have weak or context-dependent effects on gene expression. Our results suggest that caQTL discovery is more sensitive than eQTL discovery in capturing the molecular consequences of GWAS hits, and can provide complimentary information to eQTLs by implicating functional mechanisms of additional disease-associated loci.
Wang, J.; Morrison, J.
Show abstract
1Mendelian randomization (MR) uses genetic variants as instrumental variables to infer causal relationships between complex traits. Standard MR can be used to estimate an average causal effect at the population level, and typically assumes a linear exposure-outcome relationship. Recently, several methods for estimating nonlinear effects have been developed. However, many have been found to produce spurious empirical findings when subjected to negative control analyses. We propose that this poor performance may be attributable to heterogeneity in variant-exposure associations. We demonstrate that heterogeneous genetic effects on exposure lead to biased estimates, poor coverage, and inflated type I error in control function and stratification-based methods. In contrast, two-stage least squares (TSLS) methods are robust to such heterogeneity, but suffer from low precision and low power in some circumstances. We show that a statistical test for heterogeneity can be used to guide the choice of nonlinear MR methods. Using UK Biobank data, we reassess the causal effects of BMI, vitamin D, and alcohol consumption on blood pressure, lipid, C-reactive protein, and age (negative control). We find strong evidence of heterogeneity for all three exposures, and also recapitulate previous results that control function and stratification-based methods are prone to false positives. Finally, using nonparametric TSLS, we identify evidence of nonlinear causal effects of BMI on HDL cholesterol, triglycerides, and C-reactive protein; however, specific estimates of the shape of these relationships are imprecise. Altogether, our results suggest that common nonlinear MR methods are unreliable in the presence of realistic levels of heterogeneity, and that more methodological development is required before practically useful nonlinear MR is feasible.
Sweeney, M. D.; Kang, H. M.
Show abstract
Recent advances in deep learning have led to the development of sequence-to-omics (S2O) models that predict molecular phenotypes directly from DNA sequences. Here, we systematically evaluate the utility of these models, e.g., AlphaGenome, Borzoi, Enformer, and Sei, for improving the reproducibility of genetic fine-mapping across expression quantitative trait loci (eQTL) datasets from Genotype-Tissue Expression (GTEx), Trans-Omics Precision Medicine (TOPMed), and Multi-Ancestry Analysis of Gene Expression (MAGE) projects. We show that purely statistical fine-mapping often yields high replication failure rates (RFRs), but integrating S2O model predictions substantially reduces RFRs and enhances the accuracy of prioritizing SNPs replicated in other consortia. We describe a generalized framework for functionally informed fine-mapping that combines traditional posterior inclusion probabilities (PIPs) from statistical fine-mapping methods with scores from S2O models to generate functionally informed PIPs (fiPIPs) that improve reproducibility. Our findings demonstrate that S2O models, particularly newer ones like AlphaGenome and Borzoi, enable robust identification of replicated variants across consortia, highlighting their promise for scalable, functionally aware genetic mapping.
Liu, Z.; Ramteke, A.; Anand, A.; Gorla, A.; Jeong, M.; Sankararaman, S.
Show abstract
It is increasingly recognized that genetic effects on complex traits and diseases are shaped by environmental context. Biobanks that measure diverse environmental exposures alongside genotypes and phenotypes at scale enable systematic study of gene-environment (GxE) interactions. Existing approaches, however, are limited in their ability to accurately model polygenic GxE involving many exposures across genome-wide genetic variants. It is unclear which exposure combinations are relevant for a given trait while distinguishing true interactions from environment-dependent heteroskedastic noise. To address these challenges, we develop Efficient multi-eNvironmental Gene-environment Interaction iNference Estimator (ENGINE), a supervised variance-component framework that learns an embedding that combines multiple environmental exposures while jointly estimating additive, GxE, and heteroskedastic noise components. To enable biobank-scale inference, ENGINE makes a single pass over the genotype matrix to cache genotype-dependent summaries, then assembles normal-equation components and gradients at each iteration. In simulations, ENGINE controls type I error rates, achieves high power, and accurately recovers the environmental embedding while remaining efficient at biobank-scale. Applied to five complex traits paired with lifestyle exposures in N = 291,273 unrelated white British individuals and M = 454,207 common SNPs (MAF> 0.01) from the UK Biobank, ENGINE recovered GxE variance that was on average 1.4-fold larger than that captured by a single exposure and 5.5-fold larger than that captured by the first principal component of the exposures.
Mishra, S.; Patra, R. R.; Reddy, A. S.; Mandal, A.; Majumdar, A.
Show abstract
Genome-wide gene-environment (GxE) interaction studies have seen limited success in detecting reliable GxE signals. A standard genome-wide GxE scan assumes a single genetic mode of inheritance, such as an additive model. It can lead to reduced statistical power when the true genetic model is non-additive, such as a recessive model. We propose a robust GxE testing approach that uses Cauchy p-value aggregation. It combines the p-values from GxE tests based on the additive, dominant, and recessive genetic models. Using extensive simulation studies, we demonstrate that the p-value combination strategy offers a robust and powerful approach to identifying GxE interactions regardless of the underlying genetic model. The method is substantially more powerful than the additive model when the true genetic model is recessive. It is also more powerful than the general two-degree-of-freedom genotypic test for GxE interaction. We apply our approach to analyze GxE interactions in the UK Biobank data across several combinations of phenotypes and environmental factors. For glycated hemoglobin (HbA1c) level, treating cumulative smoking exposure as the lifestyle factor, our approach identified 82 independent GxE loci while controlling FDR at 5%. The GxE test based on the additive genetic model detected 24 loci. For type 2 diabetes with sleep duration as a lifestyle factor, the proposed approach detected 563 independent GxE loci at 5% FDR, substantially exceeding the number of discoveries by the other approaches.
Romanescu, R.; Liu, M.
Show abstract
We consider the problem of optimal testing for genetic interaction between two variants, allowing for possible main effects. Finding a most powerful test is important because it ends a series of attempts in the literature to construct ever more powerful tests for interaction at the variant pair level. Testing under a logistic regression model is known to be underpowered, partly because patterns of enrichment in the genotypes themselves are lost when regarding genotypes solely as predictors. Instead, we use the retrospective likelihood approach, which makes use of all the data by treating genotypes as outcomes alongside affection status. Using a parsimonious parameterization of penetrance based on the risk ratio, which links directly to the population prevalence and avoids having to estimate an intercept term, we construct an approximate uniformly most powerful unbiased test for interaction. This test is based on optimal testing theory and accounts for nuisance main effects without requiring their explicit estimation. The test statistic can be easily modified for optimal testing under other modes of genetic interaction, such as recessive x recessive or dominant x dominant. We demonstrate significant power gains compared to the odds-ratio-based PLINK test, in simulation studies. Finally, we apply the test to scan for interactions in IBD cases and controls from the UK Biobank. The top SNP pairs show enrichment for a pathway related to existing therapies for IBD.
Singh, S. K.; Adelizzi, E.; Heffner, C.; Curtis, S.; Duncan, K.; Awotoye, W.; Olotu, J.; Busch, T.; Adeyemo, W.; Gowans, L. J. J.; Naicker, T.; Murray, S. A.; Butali, A.; Leslie-Clarkson, E. J.; Dunnwald, M.; Cornell, R. A.
Show abstract
The differentiation cascade that converts basal keratinocytes into suprabasal layers, including periderm, depends on the activity of transcription factors. Mutations in the genes encoding many of these transcription factors, including TP63, IRF6 and GRHL3, disrupt periderm development. Such mutations can also interfere with embryonic fusion and septation events that depend on periderm development, including palatogenesis, digit separation and the formation of temporary epithelial fusions between digits, between eyelids, and between pinnae and the scalp. ZNF750 (Zfp750 in the mouse) is a transcription factor required for keratinocyte differentiation, but whether mutations in ZNF750 contribute risk for orofacial cleft, and the role of Zfp750 in periderm development, are unknown. To address these questions we sequenced ZNF750 in 5,659 individuals including 2,125 with nonsyndromic OFC. We identify 33 rare missense variants with frequencies less than 0.1% in gnomAD. Of these, about half are predicted to be damaging with in silico tools. Collectively, these missense variants are not overtransmitted from parents to children with OFCs. Two of the variants have lower activity than the reference variant in a zebrafish embryo-based assay but no phenotype in the corresponding murine model. However, in murine embryos homozygous for a frame-shift mutation in Zfp750 (Zfp750fs) that we generated, palatal shelves are fused but intra-oral adhesions are present, a phenotype seen in murine mutants of several bonafide OFC genes. In addition, temporary epithelial fusions are absent in Zfp750fs neonates. RNA sequencing of forelimbs from Zfp750fs embryos reveals decreased expression of epidermal terminal differentiation genes, and both increased and decreased expression of distinct periderm genes. Immunofluorescence shows the consistent presence of periderm proteins within the oral adhesions in Zfp750fs/fs embryos. Together these studies suggest that while mutations in ZNF750 are not a major contributor to OFC risk, Zfp750 does contribute to periderm-dependent morphogenic events.
Hnizda, A.; Martinez-Delgado, B.; Sanchez-Ponce, D.; Alonso, J.; Amiel, J.; Attie-Bitach, T.; Bada-Navarro, A.; Baladron, B.; Bermejo-Sanchez, E.; Brinsa, V.; Bukova, I.; Cazorla-Calleja, R.; Cervenkova, S.; Chow, S.; Dusek, P.; Fedosieieva, O.; Fernandez-Prieto, M.; Ghosh, S.; Gomez-Mariano, G.; Gregorova, A.; Hamilton, M. J.; Hartmannova, H.; Hernandez-San Miguel, E.; Herrero-Matesanz, M.; Hodanova, K.; Kadek, A.; Kerkhof, J.; Kleefstra, T.; Lacombe, D.; Levy, M. A.; Lopez-Martin, E.; Lyse, R.; Man, P.; Marin-Reina, P.; Macnamara, E. F.; McConkey, H.; Melenovska, P.; Mielu, L. M.; Moore, D.;
Show abstract
EHMT1 and EHMT2 genes encode human euchromatin histone lysine methyltransferase 1 and 2 (EHMT1 alias GLP; EHMT2 alias G9a) that form heteromeric GLP/G9a complexes with essential roles in epigenetic regulation of gene expression. While EHMT1 haploinsufficiency has been established as the cause of Kleefstra syndrome 1, the pathogenesis of G9a dysfunction in human disease remains largely unknown. We identified seven de novo EHMT2 variants in patients with clinical presentation, episignatures, histone modifications and transcriptomic profiles similar to those of Kleefstra syndrome 1. In vitro studies revealed that these variants encode for structurally stable G9a proteins that are catalytically incompetent due to aberrant interactions either with histone H3 tail or with S-adenosylmethionine. Heterozygous mice carrying a patient-derived variant exhibited growth retardation, facial/skull dysmorphia and aberrant behavior. Here we report pathogenic EHMT2 variants that likely exert dominant-negative effect on GLP/G9a complexes and thus genocopy the EHMT1 haploinsufficiency via a distinct molecular mechanism, defining an autosomal dominant EHMT2-related Kleefstra syndrome.
Jacobsen, J. T.; Moller, P. L.; Rohde, P. D.
Show abstract
Genomics offer a powerful approach to identify causal mechanisms underlying coronary artery disease (CAD) risk, with implications for pathogenesis, personalized prevention strategies, and therapeutic target discovery. Functionality-informed genome-wide fine mapping was performed using the Bayesian framework SBayesRC to estimate genetic contributions of 6.9 million common variants, based on GWAS summary statistics from over one million individuals of European ancestry. Causal candidate genes were prioritized in a 5kB flanking window within high-confidence local credible sets (LCSs). Their downstream biological influence was analyzed using protein-protein interaction networks and pathway enrichment analyses across three complimentary dimensions: molecular, cellular, and disease level. Genetic modeling captured the highly polygenic architecture of CAD, estimating on average 34,000 variants to contribute to CAD risk, explaining 3.8% of total phenotypic variance. 36 high-confidence variants (PIP > 0.9) collectively explained 13.6% of genetic variance, while most variants demonstrated small individual effects but with substantial collective contributions. 17,150 variants were prioritized within 581 high-confidence LCSs, of which 195 were annotated to genes and 170 were implicated in downstream pathway analyses. The three most influential variants were mapped to PHACTR1, APOE, and LPL, explaining 2.49%, 1.59%, and 1.46% of genetic variance respectively. Pathway analyses revealed that genetic risk in CAD is driven by dysregulation of three interlinked biological processes: 1) lipoprotein function and cholesterol metabolism, 2) vascular homeostasis, and 3) cellular stress responses and inflammation. These findings advance the causal understanding of CAD pathogenesis, supporting the transition from association-based to functionality-informed genomic approaches in cardiovascular genetics.
Ravarani, C. N. J.; Arend, M.; Baukmann, H. A.; Cope, J. L.; Lamparter, M. R. J.; Sullivan, J. K.; Fudim, R.; Bender, A.; Malarstig, A.; Schmidt, M. F.
Show abstract
Human genetics has become a cornerstone of drug target discovery, yet the value of Mendelian randomization (MR) for predicting clinical success remains uncertain. Here, we systematically evaluated MR across 11,482 target-indication pairs with documented Phase II clinical outcomes to assess its utility for drug development. We find that MR statistical significance alone does not enrich for Phase II success, in contrast to genome-wide association study (GWAS) support, which confers an increase in success probability. However, this apparent limitation reflects the heterogeneous nature of clinical failure and the fact that MR encodes information beyond P values. When MR-derived features, including instrument strength and explained variance, are integrated into machine learning models, predictive performance improves substantially. An MR-informed XGBoost classifier identifies target-indication pairs with a 55% overall approval rate, corresponding to a 6.4-fold enrichment over unstratified programs and a 2.8-fold improvement over GWAS- supported targets in Phase II. Notably, this enrichment is achieved without reliance on statistically significant MR results. Our findings demonstrate that MR is most informative when treated as a graded, context-dependent source of causal evidence rather than a binary hypothesis test, and that its integration with machine learning enables scalable, genetics-informed prioritization of drug targets across the clinical pipeline.
Aziz, M. C.; Wilson, J.; Chow, C. Y.
Show abstract
PIGA-CDG is a congenital disorder of glycosylation caused by pathogenic partial loss-of-function variants in the PIGA gene. PIGA encodes an enzyme responsible for the catalytic transfer of N-acetylglucosamine to phosphatidylinositol during the first step of glycosylphosphatidylinositol anchor biosynthesis. Loss of this enzyme has a widespread phenotypic impact, but primarily results in neurological symptoms including seizures, intellectual disability, and developmental delay. Currently, treatments are limited and focus on symptom management. We developed an eye model of PIGA-CDG that has a reduced eye size. We screened a library of 98% 1,520 FDA/EMA-approved compounds to find drugs that improved the small eye phenotype. This screen revealed numerous drugs that improved eye size, including those that targeted dopamine signaling and cyclooxygenases. Using pharmacological and genetic approaches, we show that modulating dopamine signaling improves the eye size. Genetic inhibition of dopamine 2 receptor signaling and dopamine reuptake improve both the eye model and neurologically relevant PIGA-CDG phenotypes, including seizures and locomotor deficits. We also pharmacologically and genetically validate cyclooxygenase targeting drugs in the eye model. These findings reveal novel biology underlying PIGA-CDG and point towards candidate therapeutic approaches. AUTHOR SUMMARYPIGA-CDG is a rare neurodevelopmental disorder caused by pathogenic variants in the gene PIGA. Patients primarily display neurological symptoms, including seizures, developmental delay, and intellectual disability. Fewer than 100 patients have been identified, and treatment strategies are limited. In the context of rare diseases, de novo drug development is difficult due to the high cost, lengthy development times, and often too small of a patient population to conduct a clinical trial. Our lab leverages drug repurposing screening to circumvent many of the hurdles associated with de novo drug development. Here, we develop and screen FDA- or EMA-approved compounds on a Drosophila model of PIGA-CDG, uncovering novel biology underlying PIGA-associated pathophysiology. We use pharmacological and genetic tools to demonstrate that modifying dopamine signaling and abundance, as well as cyclooxygenase-mediated pathways, contribute to PIGA associated phenotypes. This work highlights promising therapeutic targets for PIGA-CDG.
Wright, H. I. W.; Darrous, L.; Ferrat, L.; Chundru, V. K.; Kamoun, A.; Wood, A. R.; Wright, C. F.; Patel, K. A.; Frayling, T. M.; Weedon, M. N.; Beaumont, R. N.; Hawkes, G.
Show abstract
Whole genome sequencing in diverse population-scale biobanks offers new insights into the genetic architecture of complex traits from rare and non-coding variants. However, rare single variant and aggregate associations are often confounded by linkage disequilibrium and haplotype structure, resulting in large numbers of false-positive associations. Previous methods that rely on reference panels or linkage disequilibrium-matrices to determine conditional independence in meta-analyses do not scale to very rare variants, which may be observed in only one biobank and can exhibit long-range haplotypes. Here, we implement a federated approach to perform iterative conditional meta-analysis on individual-level genotype and phenotype data across biobanks while adhering to data sharing policies. We applied our methodology to a meta-analysis of LDL-C in 614,375 individuals from UK Biobank and All of Us, encompassing six genetic ancestry groups. After conditioning, only 4.3% of significantly associated rare single variants and 6.9% of aggregates remained statistically independent. The proportion of significant aggregates that remained independent after conditioning was higher for coding-based tests than non-coding. We further validate that our approach effectively suppresses false-positive associations using simulations centred on the LDLR locus. We identify allelic series of variants associated with reduced LDL-C, including loss-of-function variants in DNAJC13 and variants in the 3-prime untranslated region of LDLR. Our results highlight that federated conditioning can distinguish independent rare variant signals from linkage and haplotype structure artifacts in multi-ancestry meta-analyses across separate biobanks.
Abderrazzaq, H.; Singh, M.; Babb, L.; Bergquist, T.; Brenner, S. E.; Pejaver, V.; O'Donnell-Luria, A.; Radivojac, P.; ClinGen Computational Working Group, ; ClinGen Variant Classification Working Group,
Show abstract
Insertions and deletions (indels) represent a substantial source of genetic variation in humans and are associated with a diverse array of functional consequences. Despite their prevalence and clinical importance, indels, particularly short in-frame indels, remain critically understudied compared to single nucleotide variants and are challenging to interpret clinically. While many computational predictors for missense variants have been rigorously evaluated and calibrated for clinical use, the clinical utility of tools for in-frame indels remains uncertain. To address this gap, we have calibrated in-frame indel prediction tools for clinical variant classification. We constructed a high-confidence dataset of in-frame indel variants ([≤] 50bp) from clinical and population databases and estimated the prior probability of pathogenicity of a rare in-frame indel observed in a disease-associated gene, and of an insertion and deletion separately. Using a previously developed statistical framework based on local posterior probabilities, we then established score thresholds for eight computational tools, corresponding to distinct evidence levels for pathogenic and benign classification according to ACMG/AMP guidelines. All in-frame indel predictors evaluated here reached multiple evidence levels of pathogenicity and/or benignity, demonstrating measurable clinical value. However, these models consistently exhibited lower performance levels compared to missense predictors, highlighting the need for improved computational approaches for indel classification.
Liu, Z.; Duan, X.; Peymani, F.; Wang, J.; Bao, C.; Xu, C.; Zou, Y.; Zhang, Z.; Zhang, Y.; Li, T.; Pavlov, M.; Wang, J.; Song, M.; Song, T.; Han, X.; Sun, M.; Shen, D.; Duan, R.; Jiang, H.; Xu, M.; Prokisch, H.; Fang, F.
Show abstract
BackgroundMitochondrial diseases are the most common inherited metabolic disorders, characterized by pronounced clinical and genetic heterogeneity that complicates molecular diagnosis. Although DNA-based sequencing approaches have become standard in genetic testing, up to half of patients remain without a definitive diagnosis. RNA sequencing (RNA-seq) provides a complementary layer of evidence by revealing functional consequences of genetic variation, thereby improving diagnostic yield. MethodsWe performed RNA-seq on skin fibroblasts from 140 pediatric patients with suspected mitochondrial disease who remained genetically undiagnosed after whole exome sequencing (WES). Aberrant RNA expression and splicing were identified using the detection of RNA outliers pipeline (DROP). Based on WES findings, patients were stratified into a candidate group (n=28), in which RNA-seq evaluated the pathogenicity of WES-identified variants of uncertain significance and an unsolved group (n=112), in which RNA-seq was used to pinpoint candidate genes. In six cases where RNA-seq identified the aberrant RNA-event but WES did not detect the causative variants, whole genome sequencing (WGS) was performed. ResultsIntegrative RNA-seq, WES, and WGS analysis resulted in a genetic diagnosis in 25% of patients overall (20/28 [71%] in the candidate group; 15/112 [13%] in the unsolved group). Aberrant splicing explained most candidate-group diagnoses, including variants misclassified by in silico predictors such as SpliceAI. Fourteen percent of protein-truncating variants predicted to undergo nonsense-mediated decay (NMD) escaped degradation, highlighting the functional limits of current predictions. The variants identified in the unsolved cohort included synonymous, missense, deep intronic, near-splice-site variants, and large deletions. The most frequent amongst them was a recurrent synonymous East Asian founder mutation in ECHS1, accounting for seven cases. Interestingly, across 231 pathogenic variants associated with aberrant RNA phenotypes compiled from this study and prior reports, half were non-coding and half were coding variants. ConclusionRNA-seq substantially enhances molecular diagnosis in mitochondrial disease by exposing cryptic splicing, regulatory, and NMD-escape events invisible to DNA sequencing alone. These data advocate transcriptome analysis as an essential component of comprehensive genomic diagnostics in neuro-metabolic disease. Significance StatementMitochondrial diseases remain among the most challenging inherited metabolic disorders to diagnose, with nearly half of patients unresolved despite advanced DNA sequencing. By integrating transcriptome profiling into the diagnostic workflow, this study demonstrates that RNA sequencing can reveal pathogenic mechanisms invisible to exome or genome analysis, including cryptic splicing, regulatory variants, and transcripts that escape nonsense-mediated decay. The findings establish RNA-seq as a decisive bridge between genotype and phenotype, uncovering functional consequences of genetic variation and redefining molecular diagnostics for mitochondrial and other neuro-metabolic diseases.
Urbatsch, D.; Jeyaraj, A.; Bedekar, S.; Rao, V.; White, S. C.; Thomas, M. J.; Garrod, A.; Peroutka, C.; Ratan, A.; Kulkarni, S. S.
Show abstract
Defects in motile cilia cause a range of disorders, including heterotaxy (HTX), congenital heart disease (CHD), and primary ciliary dyskinesia (PCD). Although these conditions often co-occur, the genetic and mechanistic bases for tissue-specific manifestations remain poorly understood. Here, we identify compound heterozygous variants in DAW1, a dynein arm assembly factor, in a proband with HTX and complex congenital heart disease but no clinical signs of PCD. Whole-genome sequencing revealed a maternally inherited canonical splice-site variant (c.648+1G>A) and a paternally inherited missense variant (c.341G>A; p.Arg114Gln), both classified as variants of uncertain significance under ACMG/AMP guidelines. Using Xenopus tropicalis, we show that Daw1 depletion disrupts left-right patterning, cardiac looping, and mucociliary flow, all of which are rescued by wild-type human DAW1. Functional testing of patient alleles showed notable tissue specificity: p.Arg114Gln fully rescued mucociliary flow but did not restore left-right patterning, while the splice-site variant resulted in a complete loss of function in both contexts. These findings closely match the probands clinical phenotype and provide strong functional evidence to support reclassifying c.648+1G>A as pathogenic and p.Arg114Gln as a context-dependent hypomorphic allele. This study establishes functional criteria for interpreting DAW1 variants, shows how developmental context clarifies genotype-phenotype relationships, and highlights how in vivo models can support ACMG reclassification of unresolved HTX-related variants.
Anderson, Z. B.; Prall, T.; Damaraju, N.; Storz, S. H.; Goffena, J.; Miller, A. L.; Carroll, J.; Neitz, M.; Miller, D. E.
Show abstract
The human opsin gene cluster at Xq28 contains highly similar OPN1LW and OPN1MW genes essential for red-green color vision. Current molecular methods cannot accurately analyze this complex locus, limiting diagnosis of color vision deficiencies (CVD) and detection of carrier status. We performed Nanopore long-read sequencing of 206 individuals, comparing alignment-based analysis with targeted de novo assembly. Alignment-based methods performed poorly, whereas targeted assembly achieved 99% concordance for OPN1LW and 92% for OPN1MW copy numbers and resolved gene order in all XY individuals and 87% of XX individuals. This approach detected CVD in 3.2% of XY individuals and identified 8% of XX individuals as carriers, consistent with population estimates. Moreover, it molecularly explained the phenotypic severity in a family with Bornholm eye disease and clarified carrier status in an XX individual suspected of carrying two CVD haplotypes. Our approach provides a comprehensive, reference-free method for accurate analysis of expressed opsin genes and reliable CVD carrier detection.